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ABSTRACT 


For two independent groups, let Mj(X.) be some conditional measure of location for 
the jth group associated with some random variable Y given X = {Xi,X2). Let = 
{Xi,... ,Xi^} be a set of K points to be determined. An extant technique can be used to 
test Hq. Mi(X) = M2(X) for each X G without making any parametric assumption about 
Mj(X). But there are two general reasons to suspect that the method can have relatively 
low power. The paper reports simulation results on an alternative approach that is designed 
to test the global hypothesis Hq: Mi(X) = M2(X) for all X G fl. The main result is that 
the new method offers a distinct power advantage. Using data from the Well Elderly 2 
study, it is illustrated that the alternative method can make a practical difference in terms 
of detecting a difference between two groups. 

Keywords: ANCOVA, trimmed mean, smoothers. Well Elderly 2 study 


1 Introduction 

Eor two independent groups, consider the situation where for the jth group (j = 1, 2) Yj 
is some outcome variable of interest and X = {Xi,X 2 ) is a vector of two covariates. Let 
Mj(X.) be some conditional robust measure of location associated with Y given X. A basic 
and well known goal is determining whether the groups differ in terms of Mi(X) and M2(X). 
The classic ANCOVA (analysis of covariance) method assumes that 

Yj = /3oj + /3iXij + /32X2j + e, (1) 

where /3oj, /di and /d 2 are unknown parameters estimated via least squares regression and e 
is a random variable having a normal distribution with mean zero and unknown variance 
(T^. So the regression planes are assumed to be parallel and the goal is to compare the 
intercepts. It is well known, however, that there are serious concerns with this approach. 
First, there is a vast literature establishing that methods based on means, and more broadly 
least squares regression, are not robust (e.g., Staudte and Sheather, 1990; Marrona et al., 
2006; Heritier et ah, 2007; Hampel et al., 1986; Huber and Ronchetti, 2009; Wilcox, 2012). 
A practical consequence is that power can be relatively low even under a small departure 
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from normality. Moreover, even a single outlier can yield a poor fit to the bulk of the points 
when using least squares regression. Another concern is that two types of homoscedasticity 
are assumed. The hrst is that for each group, the variance of the error term does not depend 
on the value of the covariate. If this assumption is violated the wrong standard error is 
being used. The second is that the variance of the error term is the same for both groups. 
Violating these assumptions can result in poor control over the Type I error probability. Yet 
another fundamental concern with Q is that the true regression surfaces are assumed to 
be planes. Certainly, in some situations, this is a reasonable approximation. When there 
is curvature, using some obvious parametric regression model might suffice. (For example, 
include a quadratic term.) But it is known that this approach can be inadequate, which has 
led to a substantial collection of nonparametric regression methods, often called smoothers, 
for dealing with curvature in a more flexible manner (e.g., Hardle, 1990; Efromovich, 1999; 
Eubank , 1999; Fox, 2001; Gyorfi, et ah, 2002). Yet another concern is the assumption that 
the regression surfaces are parallel. One could test the assumption that the slope parameters 
are equal, but it is unclear when such a test has enough power to detection situations where 
this assumption is violated to the point that it makes a practical difference. 

Here, the model given by ([^ is replaced with the less restrictive model 

Y, = M,(X) + e„ (2) 

where Mj(X.) is some unknown function that reflects some conditional robust measure of 
location associated with Y given X. The random variable ej has some unknown distribution 
with variance ct|. So unlike the classic approach where it is assumed that 

M,(X) =/?o,+/di,Wi + /32,Y2, 

no parametric model for Mj(X) is specified and af = is not assumed. In particular it 
is not assumed that the regression surfaces are parallel. The goal here is to test the global 
hypothesis 

Ho : Mi(X) = M2(X), VX G {Xi,..., X;^}, (3) 

where Xi,..., X^^: are K vectors chosen empirically in a manner to be determined. 

For the case of a single covariate, Wilcox (2012, section 11.11.1) describes a method that 
tests Hq-. Mi(Yfc) = M 2 {Xk) for each k, k = 1,... , K. Roughly, for each identify values 
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of the covariate that are close to Xk and then compare the groups based on the corresponding 
Y values using a method based on a robust measure of location. For this special case, it is a 
relatively simple matter to choose values for the covariate in a manner that is likely to hnd 
any differences that might exist. 

When dealing with two covariates, Wilcox (2012) suggests a simple extension where the 
values of the covariate are chosen based on how deeply they are nested within the cloud of 
covariate values. (This is method Ml in section 2.1 of this paper.) The K points are chosen 
to include the point in the hrst group having the deepest half space depth plus the points on 
the .5 depth contour. (More precise details are given in section 2.) This typically results in a 
relatively small number of covariate values where the corresponding Y values are compared 
based on a robust measure of location. Again K tests are performed and the probability 
of one or more Type I errors can be controlled using some improvement on the Bonferroni 
method (e.g., Rom, 1990; Hochberg, 1988). But it is not al all clear when this relatively 
simple approach will choose covariate values that are likely to detect true differences between 
the groups. A way of dealing with this issue is to select a larger collection of covariate values, 
but if the familywise error rate (the probability of one or more Type I errors) is controlled, 
power can be relatively poor due to the large number of hypotheses that are tested. Switching 
to a method that controls the false discovery rate when dealing with dependent test statistics 
(e.g., Benjamini &Yekutieh, 2001) would suffer from the same concern. So the focus here is 
on testing (|^ using a specihed proportion of the deepest covariate values within the cloud 
of covariate values that are available. 

The paper is organized as follows. Section 2 reviews the method in Wilcox (2012) followed 
by a description of an alternative method aimed at testing (|^. Two variations of the 
alternative method are compared via simulations in Section 3 in terms of both power and 
their ability to control the Type I error probability. The power of both variations is compared 
to the power of the method in Wilcox (2012). Section 4 uses data from the Well Elderly 2 
study to illustrate that the new method can make a practical difference. 
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2 Description of the Methods 


The methods compared here are based in part on a method derived by Yuen (1974) for 
comparing the population trimmed means of two independent groups. To describe it, mo¬ 
mentarily ignore the covariates and consider the goal of testing 

Hq : Uti = f^t2, (4) 


the hypothesis that two independent groups have equal population trimmed means. For the 
jth group (j = 1 , 2 ), let rij denote the sample size and let Y(i)j < ... < denote the 

Yij values written in ascending order. For some 0 < 7 < .5, the 7 -trimmed mean for the jth 
group is 




1 

n-2q- ^ 

U i=g^+l 


where gj = Yirij] is the greatest integer less than or equal to Here the focus is on 7 = .2, 
a 20% trimmed mean. Under normality, this choice has good efficiency relative to the sample 
mean (Rosenberger & Gakso, 1983). Moreover, the sample 20% trimmed mean enjoys certain 
theoretical advantages. First, it has a reasonably high breakdown point, which refers to the 
proportion of values that must be altered to destroy it. Asymptotic results and simulations 
indicate that it reduces substantially concerns about the impact of skewed distributions on 
the probability of a Type I error. This is not to suggest that 20% trimming is always the 
optimal choice: clearly this is not the case. The only suggestion is that it is a reasonable 
choice among the many robust estimators that might be used. 


Winsorizing the Yij values refers to setting 



if Yij < Y(^g.j^i)j 


Y ■ 


(5) 


if Yij > Y(^n-g)j- 



The Winsorized sample mean corresponding to group j is 

Hj 

and the Winsorized variance is 
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Let hj = Tij — 2gj. 
trimming. Let 


That is, hj is the number of observations left in the jth group after 


, _ (% - 

h^{hj-l)' 


( 6 ) 


Yuen’s test statistic is 


\/di -\- d2 


(7) 


The null distribution is taken to be a Student’s t distribution with degrees of freedom 


Vy = 


{di + d2Y 

d-l 


hi — 1 


h2 — l 


2.1 Method Ml 

Method Ml is described in Wilcox (2012, section 11.11.3). A complete description of the 
many computational details is not provided here, but an outline of the method is provided 
with the goal of explaining how it differs from method M2 in the next section. 

Let Xjj {i = l,...nj; j = 1, 2) denote the rij covariate points corresponding to the 
jth group. Momentarily consider a single covariate point, X. Method Ml estimates Mj(X.) 
using the Yij such that the corresponding X^ values are close to X. More precisely, X^ 
{i = j = 1, 2)for the jth group, compute a robust covariance matrix based on 

Xy {i = 1,... ,nj). There are many ways of computing a robust covariance matrix with no 
single estimator dominating. Here a skipped covariance matrix is used, which is computed 
as follows. For fixed j, outliers among the X^ values are identihed using a projection-type 
multivariate outlier detection technique (e.g., Wilcox, 2012, section 6.4.9). These outliers 
are removed and the usual covariance matrix is computed using the remaining data. 

Next, compute robust Mahalanobis distances for each covariate point based on the robust 
covariance matrix just described, with X taken to be the center of the data. The point X^- 
is said to be close to X if its robust Mahalanobis distance is small, say less than or equal to 
/, which is called the span. Generally / = .8 performs reasonably well when the goal is to 
approximate the regression surface. Of course exceptions are encountered, but henceforth 
/ = .8 is assumed. Let T’j(x) be the subset of {1, 2, ..., nj} that indexes the Xjj values such 
that the Mahalanobis distance associated with X^- is less than or equal to /. Let Nj(X.) be 
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the cardinality of the set -Pj(X) and let Mj(X) denote the 20% trimmed mean based on the 
Yij valnes for which i G -Pj (X). Then for the single point X, (|^ can be tested by applying 
Yuen’s method with the valnes for which i G Pj{X) provided both Ni{X) and N 2 {X) 
are not too small. Following Wilcox (2012), this is taken to mean that Yuen’s method can 
be applied if simnltaneously iVi(X) > 12 and N 2 {X) > 12, in which case the two gronps are 
said to be comparable at X. 

Now consider the issne of choosing covariate valnes where the regression snrfaces will be 
compared. For hxed j, compute how deeply each Xjj is nested within the cloud of points Xjj 
(z = 1,... ,nj). This is done with a projection type method that is similar to an approach 
discussed by Donoho and Gasko (1992). Computational details are described in section 2.2. 
Consider the deepest point as well as those on the polygon containing the central half of 
the data. (Lin et al., 1999, call this polygon the .5 depth contonr.) Method Ml applies 
Yuen’s method at each of these points provided the regression snrfaces are comparable at 
these points as previously dehned. The probability of one or more Type I errors is controlled 
using the method in Hochberg (1988). 

2.2 Method M2 

There are several positive features of method Ml but some negative features as well. First, 
Yuen’s method for comparing trimmed means has been studied extensively and appears to 
perform relatively well in terms of both Type I errors and power. The method for choosing 
the covariate valnes seems reasonable in the sense that it uses points that are nested deeply 
within the clond of covariate points, which reflect sitnations where the regression snrfaces 
are comparable. Ronghly, deeply nested points correspond to situations where the regression 
snrfaces can be estimated in a relatively accurate manner. If a point X is not deeply nested 
in the cloud of covariate values, hnding a sufficiently large number of other points that are 
close to X might be impossible. 

Bnt a concern abont Ml is that perhaps trne differences might be missed because of 
the relatively small number of covariate values that are used. A way of dealing with this 
possibility is to use all of the covariate points that are deeply nested in the cloud of all 
covariate points and then test the global hypothesis given by ([^. This is the strategy 
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behind method M2. 


Method M2 begins by computing the projection depth (e.g., Wilcox, 2012, section 6.2.5) 
for each Xji (the ith covariate vector in group 1) in the same manner as method Ml. To 
describe the computational details, momentarily focus on a single n x p matrix of data, Z. 
Let r be some robust measure of location based on Z. For simplicity, the marginal medians 
(based on the usual sample median) are used. Let 

Ui = Zj - f 


(i = l,...,n). 

For any j {j = 1,... ,n), let 


a = 


^ij ^ Uj , 
k=l 

T., = ^(17a,...,17,p) 


( 8 ) 


and 


Du — \\Tij\\, 

where \\Tij\\ is the Euclidean norm associated with the vector (i = 1,... n; j = 1,..., n). 
Let 


D, 


ii 


dij 

qi2 - Qii 

where qi 2 and qu are estimates of the upper and lower quartiles, respectively, based on 
Du,..., Din- (Here, qi 2 and qa are based on the so-called ideal fourths; see Friqqe et ah, 
1989.) The projection distance of Zj, the jth row of Z, relative to the cloud of points 
represented by Z, is the maximum value of d^j, say pd{Zj), the maximum being taken over 
i = 1,... ,n (cf. Donoho & Gasko, 1992). Following Liu et ah (1999), the depth of Zj is 
taken to be 


Let the set {Xi,...,X 7 f} indicate the deepest half of the points in the hrst group. 
Points where the regression surfaces are not comparable (i.e., Ni(X.) < 12 or N 2 (X.) < 12) 
are discarded. Because K can be relatively large, controlling FWE via Hochberg’s method 
seems likely to have relatively low power, which is verified in the simulations in section 4. 



The reason for choosing the deepest half, rather than some larger proportion, is based 
on preliminary simulations. Using the deepest half, typically the regression surfaces are 
comparable at all K points when the sample sizes for both groups are greater than or equal 
to 50. For a larger proportion of points, this is often not the case. There are, of course, 
many other variations. Some other measure of the depth might be used or one could use 
all of the covariate points where the regression surfaces are comparable. The goal here is 
to hnd at least one variation that controls the Type I error probability reasonably well and 
simultaneously offers a power advantage over method Ml. 

Method M2 begins in the same manner as method Ml: test Hq\ Mi(X) = M 2 (X) for 
each X G {Xi,..., X/^}. Label the resulting p-values pi,... ^px- The idea is to test (|^ 
using some function of these K p-values. Perhaps the best-known method for testing some 
global hypothesis based on p-values is a technique derived by Fisher (1932). But Zaykin et 
al. (2002) note that the ordinary Fisher product test loses power in cases where there are a 
few large p-values. They suggest using instead a truncated product method (TPM), which 
is based on the test statistic 

w=n pF-"’ 

k=l 

where I is the indicator function. Setting r = 1 yields Fisher’s method, but Zaykin et 
al. suggest using r = .05. Zaykin et al. derive the null distribution of W when all K 
tests are independent. But the K tests performed here are not independent simply because 
Fj(Xfc) n Pj(X£), k ^ I, is not empty. If this dependence among the tests is ignored when 
computing a critical value for IF, control over the Type I error probability is poor. For the 
dependent case, Zaykin et al. suggest using a bootstrap method, but this results in relatively 
high execution time for the situation at hand making this approach difficult to study via 
simulations. Consequently, an alternative approach was used: Proceed as done by Cosset 
in his derivation of Student’s t and assume normality with the goal of determining the a 
quantile of IF, say tc, in which case (|^ is rejected at the a level if IF < w. Here, the 
critical value w was determined via simulations using (2) with Mj(X.) = 0 and ej having a 
standard normal distribution. More precisely, for each j, j) j = 1, 2) 

were generated from a trivariate normal distribution where all correlations are zero. Then 
IF was computed and this process is repeated say B times yielding IFi ,... ,Wb- Put these 
B values in ascending order yielding IF(i) < ... < IF(b). Then w was estimated to be IF(fc), 
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where k is aB rounded to the nearest integer. Here, B = 4000 was used. 

One of many alternative methods is to use instead the test statistic 

Unexpectedly, this alternative test statistic performed relatively well, in terms of power, 
under a shift in location model, as illustrated in section 4. Now reject (|^ if Q < Qa, the a 
quantile of Q, which again is determined via simulations in the same manner as the critical 
value w. 


3 Simulation Results 


As is evident, a basic issue is the impact on the Type I error probability when dealing with 
non-normal distributions as well as situations where there is an association with the covariate 
variables. Simulations were used to address this issue with ui = 77-2 = 50. Smaller sample 
sizes, such as rii = 772 = 30, routinely result in situations where no covariate values can 
be found where comparisons can be made. That is, iVi(X) < 12 or A" 2 (X) < 12 for all 
XG{Xi,...,X^}. 


Estimated Type I error probabilities, d, were based on 4000 replications. Four types of 
distributions were used: normal, symmetric and heavy-tailed, asymmetric and light-tailed, 
and asymmetric and heavy-tailed. More precisely, values for the error term, ej in ([^ were 
generated from one of four g-and-h distributions (Hoaglin, 1985) that contain the standard 
normal distribution as a special case. If Z has a standard normal distribution, then by 
dehnition 


U = 


^xpy)-i exp(hZ 2 / 2 ), if ^ > 0 
Zexp(hZV 2 ), if ^ = 0 

has a g-and-h distribution where g and h are parameters that determine the hrst four mo¬ 
ments. The four distributions used here were the standard normal ( 5 ^ = h = 0), a symmetric 
heavy-tailed distribution {h = 0 . 2 , g = 0 . 0 ), an asymmetric distribution with relatively light 
tails {h = 0 . 0 , g = 0 . 2 ), and an asymmetric distribution with heavy tails {g = h = 0 . 2 ). 
Table 1 shows the skewness (ki) and kurtosis (^ 2 ) for each distribution. Additional prop- 
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Table 1: Some properties of the g-and-h distribution. 


g 

h 


1^2 

0.0 

0.0 

0.00 

3.0 

0.0 

0.2 

0.00 

21.46 

0.2 

0.0 

0.61 

3.68 

0.2 

0.2 

2.81 

155.98 


erties of the g-and-h distribution are summarized by Hoaglin (1985). The Xjj values were 
generated from a bivariate normal distribution with correlation equal to zero. 

Three types of associations were considered. The hrst two deal with situations where 
Yij = (3Xij + e. The two choices for the slope were /? = 0 and 1. The third type was 
Yij = Xfj + e. These three situations are labeled SI, S2 and S3, respectively. Additional 
simulations were run where the correlation between the two covariates is .5. But this had 
almost no impact on the results, so for brevity they are not reported. 

Estimated Type I error probabilities are reported in Table 2. Although the seriousness 
of a Type I error depends on the situation, Bradley (1978) suggests that as a general guide, 
when testing at the .05 level, the actual level should be between .025 and .075. Based on this 
criterion, both TPM and the method based on Q provide adequate control over the Type I 
error probability. A possible appeal of TPM is that when testing at the .05 level, the actual 
level was estimated to be less than or equal to .050 among all of the situations considered. 
As for the method based on Q, the estimate exceeds .05 in some situations, particularly 
when dealing with heavy-tailed distributions {h = .2), the largest estimate being .069. 

Table 3 shows the estimated power when for the hrst group, (|^ is replaced by Yi = 
Mi(X) -|- ei -|- .5. As is evident, method M2 based on Q has the highest power among all 
of the situations considered and method Ml has the lowest power. For some situations, 
the higher power using Q, rather than TPM, is presumably due in part to a lower Type I 
error probability associated with TPM. Note, however, that even in situations where both 
methods have similar Type I error probabilities, Q has a higher estimated power. 

Some additional simulations were run where for the hrst group, Yj = Mj(X.)+ej + .5Ixi>o- 
The idea was that the two versions of method M2 are a function of the pattern of the 
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Table 2: Estimated Type I error probabilities when testing at the a = .05 level, ni = n 2 = 50 


9 

h 

s 

Q 

TPM 

0.0 

0.0 

1 

.050 

.050 

0.0 

0.0 

2 

.036 

.042 

0.0 

0.0 

3 

.048 

.049 

0.0 

0.2 

1 

.061 

.046 

0.0 

0.2 

2 

.050 

.043 

0.0 

0.2 

3 

.064 

.048 

0.2 

0.0 

1 

.055 

.046 

0.2 

0.0 

2 

.042 

.038 

0.2 

0.0 

3 

.052 

.046 

0.2 

0.2 

1 

.064 

.047 

0.2 

0.2 

2 

.053 

.042 

0.2 

0.2 

3 

.069 

.048 


Table 3: Estimated power, ni = n 2 = 50 
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h 

s 

Q 

TPM 

Ml 

0.0 

0.0 

1 

.409 

.345 

.318 

0.0 

0.0 

2 

.332 

.301 

.262 

0.0 

0.0 

3 

.341 

.307 

.270 

0.0 

0.2 

1 

.387 

.290 

.283 

0.0 

0.2 

2 

.299 

.245 

.229 

0.0 

0.2 

3 

.315 

.255 

.239 

0.2 

0.0 

1 

.410 

.327 

.324 

0.2 

0.0 

2 

.327 

.303 

.276 

0.2 

0.0 

3 

.342 

.287 

.270 

0.2 

0.2 

1 

.388 

.286 

.284 

0.2 

0.2 

2 

.299 

.243 

.231 

0.2 

0.2 

3 

.318 

.247 

.232 
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individual p-values and that perhaps a situation where a difference between the two regression 
surfaces exists only for a subset of the covariate values might result in TPM having higher 
power than Q. But again, Q had higher power than TPM. However, results in section 4 
indicate that in practice, Q does not dominate TPM in terms of power. 


4 Illustrations 

Data from the Well Elderly 2 study (Clark et ah, 2011; Jackson et ah, 2009) are used to 
illustrate that the choice of method can make a practical difference. A general goal in the 
Well Elderly 2 study was to assess the efficacy of an intervention strategy aimed at improv¬ 
ing the physical and emotional health of older adults. A portion of the study was aimed 
at understanding the impact of intervention on a measure of meaningful activities which 
was measured with the Meaningful Activity Participation Assessment (MAPA) instrument 
(Eakman et ah, 2010). Two covariates are used here. The first is a measure of depressive 
symptoms based on the Center for Epidemiologic Studies Depressive Scale (CESD). The 
CESD (Radloff, 1977) is sensitive to change in depressive status over time and has been 
successfully used to assess ethnically diverse older people (Lewinsohn et ah, 1988; Foley et 
ah, 2002). Higher scores indicate a higher level of depressive symptoms. 

The other covariate was the cortisol awakening response (CAR). Saliva samples were 
taken at four times over the course of a single day: on rising, 30-60 minutes after rising, 
but before taking anything by mouth, before lunch, and before dinner. Then samples were 
assayed for cortisol. Extant studies (e.g.. Clow et ah, 2004; Chida & Steptoe, 2009) indicate 
that measures of stress are associated with the cortisol awakening response (CAR), which is 
dehned as the change in cortisol concentration that occurs during the hrst hour after waking 
from sleep. (CAR is taken to be the cortisol level upon awakening minus the level of cortisol 
after the participants were awake for about an hour.) The sample size for the control group 
was 187 and the sample size for the group that received intervention was 228. Based on 
method Ml, no significant differences were found with the familywise error rate set at .05. 
In contrast, method M2 based on Q rejects (the p-value is .008) and the TPM version of 
method M2 rejects as well (the p-value is .021). 
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Method M2 indicates that there is a difference between the two groups, but there is the 
issue of where and by how much. A seemingly natural conclusion is that the groups differ at 
the point corresponding to the smallest p-value. Here, the minimum p-value occurs for CAR 
equal to —.218 and CESD equal to 4.00, which correspond to a relatively high increase in 
cortisol after awakening coupled with a low CESD measure of depressive symptoms. Among 
the 74 covariate points that were used, 39% of the p-values are less than or equal to .05. More 
information can be gleaned from a plot of the p-values as well as the estimated difference 
between the regression surfaces where comparisons were made. Figure 1 shows a plot of the 
p-values for the situation at hand, which suggests that the strongest evidence for a signihcant 
difference occurs when CESD is low regardless of what CAR might be. 

Figure 2 shows the estimated difference between the predicted MAPA scores. With one 
exception, all estimated differences are positive indicating that predicted MAPA scores are 
higher among the group receiving intervention. The highest estimated differences occur for 
two subgroups of participants. The first consists of those with a relatively high increase 
in cortisol after awakening coupled with a low CESD measure of depressive symptoms; the 
corresponding p-values are relatively low. The second subgroup consists of those participants 
who have both a relatively high CAR and a relatively high CESD; these points have relatively 
low p-values as well. Among participants who have relatively high depressive symptoms and 
relatively low (negative) CAR values, the difference between predicted MAPA scores is small. 
And as indicated in Figure 1, among these particular participants, highly non-signihcant 
results were obtained. 

Another dependent variable in the Well Elderly study was the RAND 36-item Health 
Survey (SF-36), a measure of self-perceived physical health and mental well-being (Hays, 
1993; McHorney et ah, 1993). Higher scores reflect greater perceived health and well-being. 
Here, the control group and the experimental group are compared based on subset of the SF- 
36 items that reflect perceived physical health, again using CESD and the CAR as covariates. 
The deepest half of the data consisted of 74 covariate points. Despite performing 74 tests, 
method Ml resulted in a signihcant result for 6 of the 74 tests that were performed, again 
with EWE set equal to .05. The TPM version of Method M2 also rejects, but Q does not 
reject; its p-value is .126. 
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Figure 1: The p-values associated with the covariate points where the regression surfaces 
were compared. The plot indicates that the strongest evidence for a signihcant difference 
occurs when CESD is low. 
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Figure 2: The z-axis indicates the estimated difference between the predicted MAPA scores. 
(Positive values indicate higher predicted MAPA scores for participants in the intervention 
group.) Estimated differences are relatively high for two subgroups: when the CAR is 
negative (cortisol increases shortly after awakening) and CESD is relatively low, and when 
both CESD and the CAR are relatively high. 
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5 Concluding Remarks 


In summary, all indications are that both versions of method M2 control the Type I error 
probability reasonably well. An apparent advantage of the TPM version of method M2 is 
that it avoids Type I error probabilities greater than the nominal level. But simulations 
indicate that choice of method can make a practical difference in terms of power, with Q 
seeming to have an advantage. However, the illustrations based on the Well Elderly 2 study 
suggest that Q does not dominate in terms of power. 

In principle, method M2 is readily extended to more than two covariates. But in practice 
this might require a relatively large sample size due to the curse of dimensionality: neighbor¬ 
hoods with a hxed number of points become less local as the dimensions increase (Bellman, 
1961). 

There are many reasonable variations of method M2 and perhaps variations other than 
those studied here often provide a practical advantage. For example, when using TPM, some 
other choice for r might be more optimal in practice. In addition, there are many alternative 
test statistics that might be used that are function of the individual p-values (e.g.. Cousins, 
2008). As is evident, resolving this issue is non-trivial. 

Finally, the R function ancov2COV, which is stored on the author’s web page, performs 
both versions of method M2. 
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